CMPINF 2100 Fall 2021 - Week 08

PCA with the Sonar data set

Sonar data set

Explore the continuous features.

Let's use separate subplots for every column of the numeric features.

Use the default wide-format plotting methods in Seaborn.

Make a Seaborn figure with sepearate histograms per facet for each variable.

Reshape from wide to long-format to be able to make use of facets.

To get greater control over the facets we need to use sns.FacetGrid().

Boxplots makes it easier to directly compare the summary statistics. Examine the summary statistics per continuous variable regardless of the categorical response.

Make a boxplot comparing the variable distribution summary statistics grouped by response.

Focusing just on the average value of the variable grouped by response. The point plot will show the average and the confidence interval on the average.

Variable relationships

The correlation plot to examine which variables are related to each other.

Does the correlation structure change across groups? Look at the correlation plot between the continuous features for each value of response.

What does enumerate() do?

Apply the .corr() method after grouping by response.

PCA

Principal Components Analysis (PCA) exploits correlation between variables. New variables are created accounting for all of the original variables.

Before running PCA, we should standardize the variables.

Standardize the continuous variables.

Check that all variables have roughly the same range.

To apply PCA we need to first initialize the object, then fit, and finally transform.

Examine the distributions of the Principal Component scores via boxplots.

The variation is decreasing across the PCs.

All of the PCs are uncorrelated.

We can decide the number of PCs to focus on. There are several hueristics to help us identify the fewest number of PCs to use. One such approach is a visual knee bend via the scree plot.

The scree plot shows the fraction of the variance explained per PC.

From the scree plot it looks 10 PCs are useful.

Since we have 10 instead of 60 variables we can use the pair plot.

Let's examine the first 10 PCs grouped by response.